A Thank-You Letter

Dear Admission Officer,

Thanks for your time.

This report is a small experiment of a time series analysis. I learned such mathematical theory and programming all by myself without external guide, and the document is just a personal note rather than a formal paper, so please be patient if any mistake happened. Besides, my original purpose is to apply the ARIMA+GARCH model in practice (“for fun”) rather than build a high performance financial model, thus little financial strategy is applied here.

I was a certified public accountant and had real exposure to business as a professional in Ernst & Young. But here I am just willing to show my attempt and convey my passion, efforts, dedication and determination for the program.

I hope to see you in the near future.

Regards,

MA Haoran

Introduction

The purpose is to build a time series model and predict the Hang Seng Index (HSI). Firstly, get the HSI index and build the training data. Then, explore the data and capture the features for a model. Find the ARIMA model with the least AIC and fit the residuals by GARCH model. Finally, make a 10 days forecast, compare the outcome with testing data.

Process Data

library(tidyquant)
library(fpp3)
library(rugarch)
library(data.table)
library(plotly)
setwd("C:/Users/Apple/Desktop/RStudio Tour/note/TSA")

The historical data is downloaded from Yahoo Finance The training data covered from 2010-01-05 to 2021-01-29.

HSI <- fread("^HSI.csv")
HSI <- HSI[, .(Date, Close = as.numeric(Close))]
HSI <- HSI %>% 
    tsibble(index = Date) %>%
    mutate(Return = difference(log(HSI$Close)) * 100)

HSI <- HSI %>%
    na.omit()%>%
    mutate(time = row_number()) %>%
    update_tsibble(index = time)
head(HSI)
## # A tsibble: 6 x 4 [1]
##   Date        Close Return  time
##   <date>      <dbl>  <dbl> <int>
## 1 2010-01-05 22280.  2.07      1
## 2 2010-01-06 22417.  0.613     2
## 3 2010-01-07 22269. -0.659     3
## 4 2010-01-08 22297.  0.123     4
## 5 2010-01-11 22412.  0.513     5
## 6 2010-01-12 22327. -0.379     6

Plot the Close Price

HSI %>%
    ggplot(aes(x = Date, y = Close)) +
    geom_line() +
    theme_tq() +
    labs(title = "HSI Index (date data)")

According to the plot, the Close Price is not stationary. In order to fit an ARIMA model, it is necessary to difference the logarithm of the initial price.

Plot the return

Return = (ln(Close Price (t+1)) - ln(Close Price(t))) *100

HSI %>%
    ggplot(aes(x = Date, y =Return )) +
    geom_line() +
    theme_tq() +
    labs(title = "HSI Return (date data)")

The Return seems stationary.

One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.

Unit Root Test:

H0:the data is stationary

HSI%>%features(Return,unitroot_kpss)
## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0308         0.1

P-Value = 0.1 > 0.05, so the null hypothesis is accepted.

Model:ARIMA+GARCH

ARIMA part

Find the ARIMA model with the least AIC.

arima.mod <- HSI %>%
    model(ARIMA(Return, stepwise = FALSE, approximation = FALSE))

arima.mod
## # A mable: 1 x 1
##   `ARIMA(Return, stepwise = FALSE, approximation = FALSE)`
##                                                    <model>
## 1                                           <ARIMA(2,0,3)>

ARIMA residual analysis:

arima.mod%>%
    gg_tsresiduals()

Ljung-box Test

H0: Data cannot be distinguished from white noise.

augment(arima.mod)%>%features(.innov,ljung_box)
## # A tibble: 1 x 3
##   .model                                                 lb_stat lb_pvalue
##   <chr>                                                    <dbl>     <dbl>
## 1 ARIMA(Return, stepwise = FALSE, approximation = FALSE)  0.0101     0.920

According to the plot and ljung-box test, pvalue>0.05, so the residuals of the model cannot be distinguished from white noise. It is rational to accept ARIMA(2,0,3).

Detect the ARCH effect on residuals:

There would be an ARCH effect if the acf and pacf of residuals^2 are significant.

p1<-arima.mod%>%
    augment()%>%
    select(.innov)%>%
    ACF(.innov^2)%>%
    autoplot()+
    labs(y="resid^2  acf")

p2<-arima.mod%>%
    augment()%>%
    select(.innov)%>%
    PACF(.innov^2)%>%
    autoplot()+
    labs(y="resid^2  pacf")

gridExtra::grid.arrange(p1,p2)

Thus the ARCH effect on residuals is significant, it plausible to fit a GARCH model on residuals.

GARCH part

Try model: ARIMA(2,0,3)+GARCH(1,1)

spec <-
    ugarchspec(
        variance.model = list(
            model = "sGARCH",
            garchOrder = c(1, 1),
            submodel = NULL,
            external.regressors = NULL,
            variance.targeting = FALSE
        ),
        mean.model = list(armaOrder = c(2, 3),
                          include.mean = TRUE),
        distribution.model = "sged"
    )


fit <- ugarchfit(spec, HSI$Return,
                 solver = "hybrid")

Report the final model

print(fit)
## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics    
## -----------------------------------
## GARCH Model  : sGARCH(1,1)
## Mean Model   : ARFIMA(2,0,3)
## Distribution : sged 
## 
## Optimal Parameters
## ------------------------------------
##         Estimate  Std. Error   t value Pr(>|t|)
## mu      0.025299    0.018994   1.33197 0.182871
## ar1    -1.312107    0.015853 -82.76469 0.000000
## ar2    -0.440748    0.020674 -21.31875 0.000000
## ma1     1.311944    0.002183 601.05156 0.000000
## ma2     0.429959    0.029283  14.68293 0.000000
## ma3     0.006806    0.011885   0.57264 0.566887
## omega   0.020438    0.007478   2.73295 0.006277
## alpha1  0.051404    0.009379   5.48086 0.000000
## beta1   0.933010    0.012840  72.66518 0.000000
## skew    0.916142    0.021129  43.36049 0.000000
## shape   1.383882    0.053582  25.82747 0.000000
## 
## Robust Standard Errors:
##         Estimate  Std. Error    t value Pr(>|t|)
## mu      0.025299    0.019449    1.30079 0.193331
## ar1    -1.312107    0.012415 -105.69088 0.000000
## ar2    -0.440748    0.016284  -27.06573 0.000000
## ma1     1.311944    0.002901  452.27417 0.000000
## ma2     0.429959    0.016539   25.99719 0.000000
## ma3     0.006806    0.006901    0.98623 0.324022
## omega   0.020438    0.008056    2.53687 0.011185
## alpha1  0.051404    0.011494    4.47233 0.000008
## beta1   0.933010    0.014748   63.26441 0.000000
## skew    0.916142    0.022258   41.16010 0.000000
## shape   1.383882    0.055471   24.94806 0.000000
## 
## LogLikelihood : -4046.441 
## 
## Information Criteria
## ------------------------------------
##                    
## Akaike       2.9933
## Bayes        3.0173
## Shibata      2.9933
## Hannan-Quinn 3.0020
## 
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
##                          statistic p-value
## Lag[1]                       3.696 0.05453
## Lag[2*(p+q)+(p+q)-1][14]     8.013 0.19343
## Lag[4*(p+q)+(p+q)-1][24]    11.670 0.59853
## d.o.f=5
## H0 : No serial correlation
## 
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
##                         statistic p-value
## Lag[1]                      2.893 0.08898
## Lag[2*(p+q)+(p+q)-1][5]     6.596 0.06492
## Lag[4*(p+q)+(p+q)-1][9]     8.511 0.10215
## d.o.f=2
## 
## Weighted ARCH LM Tests
## ------------------------------------
##             Statistic Shape Scale P-Value
## ARCH Lag[3]     1.928 0.500 2.000  0.1650
## ARCH Lag[5]     2.589 1.440 1.667  0.3551
## ARCH Lag[7]     3.664 2.315 1.543  0.3976
## 
## Nyblom stability test
## ------------------------------------
## Joint Statistic:  1.5655
## Individual Statistics:              
## mu     0.10650
## ar1    0.06487
## ar2    0.06350
## ma1    0.08800
## ma2    0.08856
## ma3    0.12493
## omega  0.09098
## alpha1 0.07925
## beta1  0.07443
## skew   0.18009
## shape  0.27474
## 
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic:          2.49 2.75 3.27
## Individual Statistic:     0.35 0.47 0.75
## 
## Sign Bias Test
## ------------------------------------
##                    t-value    prob sig
## Sign Bias           0.6568 0.51138    
## Negative Sign Bias  0.2529 0.80038    
## Positive Sign Bias  2.0783 0.03778  **
## Joint Effect       10.3007 0.01618  **
## 
## 
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
##   group statistic p-value(g-1)
## 1    20     20.06       0.3911
## 2    30     32.83       0.2845
## 3    40     38.35       0.4995
## 4    50     56.52       0.2146
## 
## 
## Elapsed time : 4.764131

Forecast

Build a forecast function to predict return in 10 days.

fore<- function(input, p, q, n) {
        spec <-
            ugarchspec(
                variance.model = list(
                    model = "sGARCH",
                    garchOrder = c(1, 1),
                    submodel = NULL,
                    external.regressors = NULL,
                    variance.targeting = FALSE
                ),
                mean.model = list(armaOrder = c(p, q),
                                  include.mean = TRUE),
                distribution.model = "sged"
            )
        fit <- ugarchfit(spec, input$Return,
                         solver = "hybrid")
        fore<-ugarchforecast(fit,n.ahead = n)
        pred<-fore@forecast$seriesFor
        pred<-as.data.table(pred)
        setnames(pred,"Return")
        pred[,time:=seq.int(from = 1,to=n,by=1)][]
       
}


pred<-fore(HSI,2,3,10)

#View the predictions
pred
##           Return time
##  1:  0.154925180    1
##  2: -0.109081450    2
##  3:  0.138608698    3
##  4: -0.064147019    4
##  5:  0.092721230    5
##  6: -0.023742514    6
##  7:  0.059931009    7
##  8:  0.001473559    8
##  9:  0.041297048    9
## 10:  0.014809375   10

Examine the predictions with testing set. The raw test data starts from 2021-01-29, the last day of the traing set.

 test <- fread("test.csv")
    test <- test[, .(Date, Close = as.numeric(Close))]
    test <- test %>% 
        tsibble(index = Date) %>%
        mutate(Return = difference(log(test$Close)) * 100)
    
    test <- test %>%
        na.omit()%>%
        mutate(time = row_number()) %>%
        update_tsibble(index = time)

# Preview the test data
head(test,5)
## # A tsibble: 5 x 4 [1]
##   Date        Close Return  time
##   <date>      <dbl>  <dbl> <int>
## 1 2021-02-01 28893.  2.13      1
## 2 2021-02-02 29249.  1.22      2
## 3 2021-02-03 29307.  0.201     3
## 4 2021-02-04 29114. -0.664     4
## 5 2021-02-05 29289.  0.600     5

Plot forecast value and test value.

foretest <- function(pred, test) {
   p<- pred %>%
        tsibble(index = time) %>%
        left_join(test, by = "time") %>%
        mutate(Pred = Return.x,
               Actual = Return.y) %>%
        select(time, Date, Close, Pred, Actual) %>%
        pivot_longer(cols = c(Pred, Actual), values_to = "Return",
                     names_to="Type") %>%
        ggplot(aes(x = Date, y = Return, color = Type)) +
        geom_line() +
        scale_x_date(date_minor_breaks = "1 day")+
        theme_tq()
   ggplotly(p)
}

foretest(pred,test)

Discrepancy could easily be observed, the actual line seems 1 lag behind the prediction line, but the locations of inflection points are pretty close. (Perhaps this is caused by effective market… anyway, effective market maybe a good news for society. Just kidding!)

Of course the model has to be improved. Cross-validation would be useful, and it might be fun to use bootstrap and function hilo() to draw the prediction interval.

To assure the reproducibility, all code and data have been uploaded to Github